For a time series to be stationary, it should look the same at any point in time, no trends or seasonality. However, cyclic behaviour is allowed for stationary time series.
A: looks not stationary because it has an obvious trend of exponential decline.
B: looks not stationary because it has rather obvious seasonality. However, it is a bit hard to distinguish because the cycles could be periodic. Since we do not have any information on what kind of time series it is, I cannot with zero uncertainty say it is stationary. Therefore I would conclude that it is non stationary because of its seasonality.
C: looks stationary
D: looks not stationary because of obvious seasonality and trend.
E: looks stationary at first glance. However, it also looks as if it has some seasonality because of its two low points at approx dec 2014 and dec 2018. Perhaps this time series is affected by a 4-year happening or somethings. I would need more information on what the exact time series is about to address this further. I would anyway say that this is more of a periodic matter which tells us that it is not something we know that is going to happen - and therefore the time plot is stationary.
F: Looks stationary because of no trend, seasonality or cyclic behaviour
ACF plot 2 should belong to time plot A because the data has an obvoius trend which should make the autocorrelations for small lags large and positive and decrease over larger lags.
ACF plot 1 should belong to time plot C, because it is the clearest stationary time series and should therefore have “the best” ACF plot.
ACF plot 6 should belong to time plot E, because it is stationry with some small resembelence of perhaps seasonality - could make the autocorrelations close to the limit.
ACF plot 5 should belong to time plot D because of the obvious seasonality and trend which should make the ACF plot capture both. Slowly decrease in lags due to trend, and bump in shape due to seasonality.
ACF plot 4 should belong to time plot B because of its seasonality, which can be seen in the ACF plot also.
ACF plot 3 should belong to time plot F
# importing data
house <- readRDS("attachments/houseprices.rds")
# looking at the data
head(house)
## # A tsibble: 6 x 3 [1Q]
## quarter priceindex seasonally_adjusted
## <qtr> <dbl> <dbl>
## 1 2005 Q1 53.3 52.5
## 2 2005 Q2 54.4 54
## 3 2005 Q3 56.1 55.9
## 4 2005 Q4 55.3 56.7
## 5 2006 Q1 61 60.2
## 6 2006 Q2 63 62.5
# plot the columns priceindex
house %>% autoplot(priceindex)
The time plot displays a time series that is increasing in a linear
matter, with a dip in 2008 –> mainly beacuse of the financial
crisis.
The definition of an additive decomposition is:
\[ \begin{equation} \tag{1.1} y_t = S_t + T_t + R_t \end{equation} \] The columns identified in terms of the components is as follows: y_t = priceindex and s_t = seasonally_adjusted
To create my own seasonally adjusted price index I am performing decomposition.
# decomposing using the STL method
dcmp <- house %>%
model(
classical = classical_decomposition(priceindex, type = "additive"),
stl = STL(priceindex),
x11 = X_13ARIMA_SEATS(priceindex ~ x11()),
seats = X_13ARIMA_SEATS(priceindex ~ seats())
)
dcmp %>% select(x11) %>% components() %>% autoplot()
Above I have used a decomposition model: the x-11 method from official
statistics agencies to calculate seasonally adjusted price index. After
fitting classical, stl, x11 and seats decomposition models, I find all
models except classical to be good enough. Each capture the decline in
priceindex in 2008 well and the remainder is little. So for the purpose
of this time series, and since we have quarterly data, I wanted to use a
decomposition from official statistics agencies, and landed on the X-11
model. Further i am comparing it to the seasonally_adjusted price index
from the dataframe.
dcmp %>%
select(x11) %>%
components() %>%
as_tsibble() %>%
left_join(house, by = c("quarter")) %>%
select(quarter, season_adjust, seasonally_adjusted) %>%
ggplot() +
geom_line(aes(x = quarter, y=season_adjust), colour = "blue") +
geom_line(aes(x = quarter, y=seasonally_adjusted), colour = "black")
The figure above plots the seasonally_adjusted column provided in the
data, and the season_adjust component extracted from the X-11
decomposition model. We can see that the values follows the same path
and looks quite similar.
A bit hard to interpret the question, since it is not clear if you are asking if I could choose one quarter through the entire time series, or if you are asking for a general quarter I would choose to buy/sell every year in the time series.
When it comes to the first point, and there is no limits to how long I can have my house, I would definitely buy my house at the very beginning of the time series and sell at the very end of the time series.
When it comes to the second point. I am making a subseries-plot to see how the time series are varying across quarters
The subseries plot motivates me to buy houses in the first quarter, and
sell houses in the second quarter. This is when addressing the mean of
priceindex (blue-line)
The season plot is hard to interpret and to use to answer the question.
However we can see that in order to not loose value from when you buy
our house and sell it, you have to own the house a couple of years. If
we think of the take away from the subseries plot, buy in first quarter
and sell in second quarter, we would loos value in some years, but if we
keep our house through to the next year, we are pretty sure the value
wont have decreased from the initial value when you bought the
house.
I am importing the data, limiting to vessels transiting the canal going North, removing the transit_bound column for simplicity purposes, formatting the year-week column to yearweek and transforming to a tsibble with vessel_type as key and year_week as index.
# Reading in the data
df <- readRDS("attachments/panamacanal.rds") %>% ungroup()
# Filtering and formatting
panama <- df %>%
filter(transit_bound == "North") %>% # limiting to transit_bound North
select(-transit_bound) %>% # Removing transit_bound column (do not need it anymore)
mutate(
year_week = yearweek(year_week) # formatting to year_week
) %>%
as_tsibble(key = c("vessel_type"), index = "year_week")
Further i am creating exploratory graphics and commenting on the
plots
The plot above plots waiting time for all vessels in the period of
analysis. It is very hard to distinguish any seasonality or trends here.
The only take away from this plot, is perhaps the somewhat spike at 2020
W10 for the top four vessels.
The plot above plots the differen years of each vessel over the time
horizong. There is impossible to see if there is any seasonality in the
data here. One interesting thing to see is all the gaps in the data,
especially for the “Oil tanker” vessel Based on the plots above, there
is no evidence that there is any seasonality or trends in the data of
waiting_time.
Further I will create a separate data object with only vessel type Bulk carrier
# creating sepearate dataset only containg Bulk carrier
bulk <- panama %>%
filter(vessel_type == "Bulk carrier")
The figure above plots the waiting time. And we can see a spike in the
data in 2020 week 10. Lets make a season plot to see if there is any
interesting characteristics there.
The season plot shows none relationships between years in the data.
## # A tibble: 8 × 5
## vessel_type mean max min st.dev
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Bulk carrier 54.0 187. 19.2 24.7
## 2 Chemical tanker 51.1 195. 15.0 27.4
## 3 Container 6.16 24.2 1.52 3.70
## 4 General cargo 33.2 135. 8.52 16.9
## 5 LPG Tanker 32.0 105. 4.17 15.9
## 6 Oil tanker 59.5 199. 11.1 39.2
## 7 Refrigerated bulk 15.4 44.1 3.65 7.21
## 8 Vehicle 11.3 32.0 2.34 5.54
The table above calculates the mean, max, min and st.dev of waiting-time for all vessels. (1) The vessel with the shortest average waiting time, and therefore would be the most willing to pay for fast_track is vessels of type “Container”. (2) The vessel with the highest variation in waiting times, measured by standard deviation, is vessels of type “Oil Tanker”.
The most obvious argument for using log-transform is that it ensures that we have values, in all steps of the forecasting, on the same scale and inside equal limits/parameters. It ensures that we do not get negative values when forecasting while being easy to interpret.
In this task, we are considering to model log(waiting time) as and ARIMA(p,d,0) or ARIMA(0,d,q). I shall therefore determine d using a unitroot test.
Firstly I am utlizing unitroot_ndiffs() which uses a sequence of KPSS test to determine the appropate d. The utilization results in a d of 1.
## # A tibble: 1 × 2
## vessel_type ndiffs
## <chr> <int>
## 1 Bulk carrier 1
Secondly I am utlizing unitroot_nsdiffs() which uses the measure of seasonal strength to determine the appropriate d. The utilization results in a d of 0.
## # A tibble: 1 × 2
## vessel_type nsdiffs
## <chr> <int>
## 1 Bulk carrier 0
To determine p and q from the ACF and PACF plots, the data has to be from an ARIMA(p,d,0) or ARIMA(0,d,q) model,which is assumed in the task description on the beginning of page 7. Based on the ACF plot, I am seeing characteristics of a decaying sinusoidal pattern even though it is decaying slowly, with the last significant spike at lag 10. The PACF plot shows that the last significant spike was at lag 6 Based on this, I would suggest p = 6 and q = 10.
fit <- train %>%
model(ARIMA(log(waiting_time)))
fit %>%
gg_tsresiduals()
#report(fit)
The above plot show the residuals of the fitted ARIMA model with automatic procedures. The ACF shows a single significant spike in lag, which is not enough to disregard this as white noise.
## Series: waiting_time
## Model: ARIMA(1,1,1)
## Transformation: log(waiting_time)
##
## Coefficients:
## ar1 ma1
## 0.5819 -0.9400
## s.e. 0.0971 0.0506
##
## sigma^2 estimated as 0.1052: log likelihood=-40.18
## AIC=86.37 AICc=86.54 BIC=95.17
Above is the report of the model in (e) printed It gives us an ARIMA(1,1,1) model and the following equations:
\[ \begin{equation} \tag{1.2} y_t = log(waiting.time) \end{equation} \] \[ \begin{equation} \tag{1.3} (1-B)^2y_t = (1+\theta B )\epsilon_t \end{equation} \] \[ \begin{equation} \tag{1.4} where \space N(0,\sigma^2),\theta=-0.94 \space and \space \sigma^2 = 0.11 \end{equation} \]
## df AIC
## forward 10 91.32917
## backward 6 84.58512
The backwards stepwise regression works in the following way: firstly it takes the model containing all potential predictors and the remove one predictor at a time. It keeps the model if it improves the measure of accuracy, in our example AIC. However, if the number of potential predictors is too large, the backward regression will not work and the forward regression should be used instead. The forward stepwise regression starts from the other side, compared to the backwards regression. It begins with a model only containing the intercept, and add predictors one at a time. And again, the one the improves accuracy (AIC in our case) the most, is kept in the model. The reason to why these not necessarily gives the same model specification is because they “go” from each different end of the number of potential predictors. That makes each method to combine predictors differently. Based on the AIC in our example, the backward stepwise regression would be the best.
## Warning: Removed 1 row containing missing values (`geom_line()`).
The innovation residuals looks like white noise
Comparing model performance of the backward stepwise model selection and the ARIMA model from (e)
## # A tibble: 2 × 11
## .model r_squared adj_r_squared sigma2 statistic p_value df log_lik AIC
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl>
## 1 backw… 0.444 0.427 0.102 26.7 2.58e-16 5 -36.3 -310.
## 2 arima NA NA 0.105 NA NA NA -40.2 86.4
## # ℹ 2 more variables: AICc <dbl>, BIC <dbl>
In terms of AIC the backwards model perform much better
Forecasting the backwards stepwise model and plotting the forecast with the observed values.
fit2 %>%
forecast(test) %>%
autoplot(bulk)
This is an ex-ante forecast since we are only using the data we have in
the train set to build the model and forecast. Data up until week 35,
2021. Therefore, we have forecasts of the predictors. If we on the
contrary would observe and use later information on the predictors it
would be an ex-post forecast.
In the code below I am filling gaps in the data with NAs. Then, I use an automatic ARIMA model to estimate the missing values, and insert them into the data with interpolate(). Lastly, I split into train and test set.
# filling in missing values
panama_miss <- panama %>%
fill_gaps()
# check in which columns the missing values are
# t <- panama_miss %>% summarise(across(everything(), ~ sum(is.na(.))))
# colSums(t[,-1])
# there are missing values in every column
# start cluster
cores <- detectCores()
cluster <- makeCluster(cores)
# fitting ARIMA for every column and interpotlating
wait <- panama_miss %>%
group_by(vessel_type) %>%
model(ARIMA(log(waiting_time))) %>%
interpolate(panama_miss)
trans <- panama_miss %>%
group_by(vessel_type) %>%
model(ARIMA(log(transits_lag_1))) %>%
interpolate(panama_miss) %>%
suppressWarnings()
sog <- panama_miss %>%
group_by(vessel_type) %>%
model(ARIMA(log(sog_lag_1))) %>%
interpolate(panama_miss)
gatun_level <- panama_miss %>%
group_by(vessel_type) %>%
model(ARIMA(log(gatun_level_mt_lag_1))) %>%
interpolate(panama_miss)
culebra <- panama_miss %>%
group_by(vessel_type) %>%
model(ARIMA(log(visibility_culebra_lag_1))) %>%
interpolate(panama_miss) %>%
suppressWarnings()
limon <- panama_miss %>%
group_by(vessel_type) %>%
model(ARIMA(log(visibility_limon_lag_1))) %>%
interpolate(panama_miss) %>%
suppressWarnings()
corozal <- panama_miss %>%
group_by(vessel_type) %>%
model(ARIMA(log(visibility_corozal_lag_1))) %>%
interpolate(panama_miss) %>%
suppressWarnings()
gatun <- panama_miss %>%
group_by(vessel_type) %>%
model(ARIMA(log(visibility_gatun_lag_1))) %>%
interpolate(panama_miss)
wait_lag <- panama_miss %>%
group_by(vessel_type) %>%
model(ARIMA(log(waiting_time_lag_1))) %>%
interpolate(panama_miss)
panama_fill <- wait %>%
left_join(trans,by = c("vessel_type", "year_week"))%>%
left_join(sog,by = c("vessel_type", "year_week"))%>%
left_join(gatun_level,by = c("vessel_type", "year_week"))%>%
left_join(culebra,by = c("vessel_type", "year_week"))%>%
left_join(limon,by = c("vessel_type", "year_week"))%>%
left_join(corozal,by = c("vessel_type", "year_week"))%>%
left_join(gatun,by = c("vessel_type", "year_week"))%>%
left_join(wait_lag,by = c("vessel_type", "year_week"))
#t <- panama_fill %>% summarise(across(everything(), ~ sum(is.na(.))))
#colSums(t[,-1])
# zero nas
# close cluster
stopCluster(cluster)
# making train and test set
test2 <- panama_fill %>%
filter(year_week >= yearweek("2021 W36"))
train2 <- panama_fill %>%
filter(year_week < yearweek("2021 W36"))
# short look at the data
panama_fill %>%
gg_season(waiting_time)
We can see now that the data is filled. Compared to the season plot in
task 3a)
models <- train2 %>%
model(
arima = ARIMA(log(waiting_time)),
backward = TSLM(log(waiting_time) ~
log(waiting_time_lag_1) +
log(transits_lag_1) +
log(visibility_gatun_lag_1) +
log(visibility_limon_lag_1)
),
naive = NAIVE(waiting_time),
drift = NAIVE(waiting_time ~ drift())
) %>%
mutate(
comb_non_benchmark = (arima + backward) / 2,
comb_all = (arima + backward + naive + drift) / 4
)
fc <- models %>%
forecast(new_data = test2)
options(dplyr.summarise.inform = FALSE)
accuracy(fc, test2) %>%
as_tibble() %>%
group_by(vessel_type, .model) %>%
summarise(
RMSE = min(RMSE)
) %>%
filter(RMSE == min(RMSE))
## # A tibble: 8 × 3
## # Groups: vessel_type [8]
## vessel_type .model RMSE
## <chr> <chr> <dbl>
## 1 Bulk carrier backward 13.4
## 2 Chemical tanker backward 21.4
## 3 Container naive 2.07
## 4 General cargo backward 11.2
## 5 LPG Tanker comb_all 16.3
## 6 Oil tanker comb_non_benchmark 42.8
## 7 Refrigerated bulk backward 7.42
## 8 Vehicle comb_all 4.03
The table above lists which model that is the best for which vessel type.